This repository contains the scripts, tutorials, and templates for the Rich Lab’s mainstream bioinformatic workflows. You can find some of the current recommended tutorial versions of these workflows linked below.
MetadataSetup.htmlThe purpose of a metadata file is to organize the potential
independent or predictor variables for your analysis into a single table
with one row per SampleID. Then, when you produce a set of
potential dependent or outcome values, you organize those into a similar
structure with the SampleIDs organized rowwise to
streamline the process of matching predictor variables to outcome
variables by SampleID.
It’s good practice to keep as much of your information in one tidy table as possible so that you can keep pulling from that source to further filter, wrangle and analyze without losing track of different versions and datasets over time.
That means you should brainstorm as many possible predictor variables you might use in downstream analysis as possible and organize them into one tidy table where each SampleID is matched to a value for every variable. You will end up ignoring most of these variables as you construct individual tests and visuals later, so consider this simply a rough draft of your information of interest.
This tutorial uses the Pygmy Loris dataset as an example. You may have a very different set of variables to organize for your own project.
compilation_loris.tsv - produced by the
SampleInventory.Rmd scriptsamples_metadata.tsv - This file gets carried forward
on several other scripts in this repository to help you build consistent
predictor variables across datasets.microbiome_tidy_alignments.htmlwf-16s pipeline and convert it into fully
formatted, tidy tables that you can normalize and analyze using
microeco or a similar package.samples_metadata.tsv - produced by the
MetadataSetup.Rmd script*_abundance_table_species.tsv - produced by the
wf-16s pipeline*_wf-16s-report.html - produced by the
wf-16s pipeline.csv file per demultiplexed sample from this
.html file for a given sequencing run and merge those into
a table with raw alignment datataxonomy_table.tsv - This table is required for
microeco and several other metagenomic packages. It must
contain exactly one row per taxon in your dataset.otu_table.tsv - This table is required for
microeco and several other metagenomic packages. It must
contain exactly one row per taxon and one column per sample in your
dataset. The numeric values represent the raw count of reads mapped to a
given taxon within a given sample. We will want to normalize these
before analyzing/interpreting them.sample_table.tsv - This table is required for
microeco and several other metagenomic packages. It must
contain exactly one row per sample in your dataset. The columns contain
metadata values that we may use as independent/predictor variables for
interpreting our results.microbiome_references.htmlalignments.tsv - raw alignment data gathered from
the wf-16s output and organized into one table in the previous
scriptrepresentative_seqs.fasta - This contains one GenBank
reference sequence for each taxon in your dataset that we can use to
substitute for ASVs used in many approaches designed for short read 16S
data.representative_seqs_tax4fun.fasta - This is the same as
the previous file, except with different syntax to work with the
Tax4Fun2 package.tree.newick - This is a phylogenetic tree produced from
the previous fasta multiple sequence alignment.I created a secondary repository called workflows_first_use.
This is where I will store some scripts/tutorials with specific
instructions for setting up your working directory and environment for
specialized packages (especially for microeco).
params option in the R Markdown HeaderHere is what my default yaml header on most R Markdown
documents in this repository look like:
---
output:
html_document:
theme:
bslib: true
css: journal.css
toc: true
toc_float: true
df_print: paged
params:
sampleset: "loris"
seqrun: "hdz18"
---
I use the params option for streamlining some automation
across different scripts. You can use the sampleset setting under params
in the header of this script to select which sampleset you will be
working with. Everywhere in a chunk of code where I have written
params$sampleset will be replaced with whichever strong you
change your params value to in the header.
config PackageI also use config for streamlining and organization. If
you want to do the same, you should create your own
config.yml file in your project working directory. You can
expand the content below to see what my example config.yml
file looks like as well as the R script I use to integrate this with the
params values for consistent file sourcing and writing
across datasets.
default:
setup: "setup/global/setup.R"
conflicts: "setup/conflicted.R"
functions: "setup/global/functions.R"
packages: "setup/global/packages.R"
inputs: "setup/default_inputs.R"
fonts: "setup/global/fonts/FA6_Free-Solid-900.otf"
test_prediction: "setup/microbiome/test_prediction/"
visuals: "visuals/"
tmp_tsv: "tmp/tmp_table.tsv"
tmp_downloads: "tmp/downloads/"
tmp_fetch: "tmp/fetch_references.txt"
tmp_fasta3: "tmp/tmp3.fasta"
tmp_fasta4: "tmp/tmp4.fasta"
git_url: "https://rich-molecular-health-lab.github.io/"
isolates:
samplesets: "salci"
minQC: 10
microbiome:
micro_scripts: !expr list("setup/microbiome/functions.R", "setup/microbiome/inputs.R")
setup_dir: "setup/microbiome"
inputs: "setup/microbiome/inputs.R"
packages: "setup/microbiome/packages.R"
first_use_packages: "setup/microbiome/packages_first_use.R"
functions: "setup/microbiome/functions.R"
tax4fun: "setup/microbiome/Tax4Fun2_ReferenceData_v2/"
Ref99NR: "setup/microbiome/Tax4Fun2_ReferenceData_v2/Ref99NR/"
blast: "setup/microbiome/ncbi-blast-2.16.0+/bin"
swan:
functions: "setup/read_processing/functions.R"
loris_tax: "/work/richlab/aliciarich/bioinformatics_stats/data/loris/taxonomy/"
uid_file: "/work/richlab/aliciarich/bioinformatics_stats/tmp/fetch_references.txt"
logs: "/work/richlab/aliciarich/bioinformatics_stats/logs"
ont_reads: "/work/richlab/aliciarich/ont_reads/"
dorado_model: "/work/richlab/aliciarich/ont_reads/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/"
bioinformatics_stats: "/work/richlab/aliciarich/bioinformatics_stats"
samplesheets: "/work/richlab/aliciarich/bioinformatics_stats/dataframes/sample_sheet/loris"
scripts: "/work/richlab/aliciarich/bioinformatics_stats/batch_scripts/"
accessions: "/work/richlab/aliciarich/bioinformatics_stats/tmp/accessions.txt"
tmp_fasta1: "/work/richlab/aliciarich/bioinformatics_stats/tmp/tmp1.fasta"
tmp_fasta2: "/work/richlab/aliciarich/bioinformatics_stats/tmp/tmp2.fasta"
tmp_fasta3: "/work/richlab/aliciarich/bioinformatics_stats/tmp/tmp3.fasta"
tmp_fasta4: "/work/richlab/aliciarich/bioinformatics_stats/tmp/tmp4.fasta"
raw_loris_mb: "/work/richlab/aliciarich/ont_reads/loris_microbiome/hdz_raw"
basecalled_loris_mb: "/work/richlab/aliciarich/ont_reads/loris_microbiome/basecalled"
trimmed_loris_mb: "/work/richlab/aliciarich/ont_reads/loris_microbiome/trimmed"
filtered_loris_mb: "/work/richlab/aliciarich/ont_reads/loris_microbiome/filtered"
loris_mb_aligned: "/work/richlab/aliciarich/bioinformatics_stats/data/loris/taxonomy/refseqs_aligned.fasta"
loris_mb_tree: "/work/richlab/aliciarich/bioinformatics_stats/data/loris/taxonomy/refseqs_tree.newick"
bats:
sequencing:
coverage: "visuals/bats_16s_depth_summary.html"
depth_plot: "visuals/bats_16s_depth_hist.html"
metadata:
summary: "metadata/bats/samples_metadata.tsv"
key: "metadata/bats/metadata_key.R"
factors: "metadata/bats/factors.R"
scripts: !expr list("metadata/bats/colors.R", "metadata/bats/metadata_key.R", "metadata/bats/factors.R")
inventories:
all_stages: "../read_processing/samples/inventories/bats/compilation_bats.tsv"
collection: "../read_processing/samples/inventories/bats/samples_bats.csv"
extraction: "../read_processing/samples/inventories/bats/extracts_bats.csv"
libraries: "../read_processing/samples/inventories/bats/libraries_bats.csv"
seqruns: "../read_processing/samples/inventories/bats/seqruns_bats.csv"
barcodes_output: "../read_processing/samples/barcode_alignments/bats/"
read_alignments: "data/bats/read_alignments"
taxa_reps:
aligned: "data/bats/taxonomy/refseqs_aligned.fasta"
tree: "data/bats/taxonomy/refseqs_tree.newick"
table: "data/bats/taxonomy/tax_table.tsv"
loris:
day1: "2023-10-26"
last: "2024-10-25"
visuals:
demographics: "visuals/loris/demography"
coverage: "visuals/loris_depth_summary.html"
depth_plot: "visuals/loris_depth_hist.html"
AZAstudbooks:
btp: "metadata/loris/Studbook/BTP25.R"
btp_current: "metadata/loris/Studbook/BTP25.tsv"
lifetable: "metadata/loris/Studbook/lifetable25.tsv"
lifetabStatic: "metadata/loris/Studbook/lifetable25_static.tsv"
load_data: "metadata/loris/Studbook/LoadStudbookData.R"
functions: "metadata/loris/studbookFunctions.R"
reactables: "metadata/loris/Studbook/reactables.R"
living25: "metadata/loris/Studbook/LivingPop_2025.csv"
living21: "metadata/loris/Studbook/LivingPop_2021.csv"
historic21: "metadata/loris/Studbook/HistoricPop_2021.csv"
institutions:
current25: "metadata/loris/Studbook/CurrentInstitutions_2025.csv"
current21: "metadata/loris/Studbook/CurrentInstitutions_2021.csv"
historic21: "metadata/loris/Studbook/HistoricInstitutions_2021.csv"
working: "metadata/loris/Studbook/working_studbook_loris25.tsv"
reactable_ready: "metadata/loris/Studbook/studbook_loris_reactableReady.tsv"
timeline: "metadata/loris/Studbook/working_timeline_loris25.tsv"
metadata:
scripts: !expr list("metadata/loris/colors.R", "metadata/loris/metadata_key.R", "metadata/loris/diet_schedule.R")
bristol: "metadata/loris/bristols.tsv"
studbook: "metadata/loris/subjects_loris.csv"
summary: "metadata/loris/samples_metadata.tsv"
key: "metadata/loris/metadata_key.R"
factors: "metadata/loris/factors.R"
foods: "metadata/loris/foods.tsv"
proteins: "metadata/loris/proteins.tsv"
fats: "metadata/loris/fats.tsv"
CHOs: "metadata/loris/CHOs.tsv"
Ash: "metadata/loris/Ash.tsv"
vitamins: "metadata/loris/vitamins.tsv"
reactable: "metadata/loris/loris_metadata_summary.html"
sample_table:
identifiers: "metadata/loris/identifier_key.tsv"
main: "metadata/loris/sample_table.tsv"
merged: "metadata/loris/sample_table_merged.tsv"
inventories:
all_stages: "../read_processing/samples/inventories/loris/compilation_loris.tsv"
collection: "../read_processing/samples/inventories/loris/samples_loris.csv"
extraction: "../read_processing/samples/inventories/loris/extracts_loris.csv"
libraries: "../read_processing/samples/inventories/loris/libraries_loris.csv"
seqruns: "../read_processing/samples/inventories/loris/seqruns_loris.csv"
outputs_wf16s: "data/loris/outputs_wf16s/"
barcodes_output: "dataframes/barcodes/loris/"
read_alignments: "data/loris/read_alignments"
taxa_reps:
aligned: "data/loris/taxonomy/refseqs_aligned.fasta"
tree: "data/loris/taxonomy/refseqs_tree.newick"
table: "data/loris/taxonomy/tax_table.tsv"
abundance_wf16s: "data/loris/wf16s_abundance/"
microeco:
dataset:
main:
keg: "microeco/loris/datasets/main/keg"
njc: "microeco/loris/datasets/main/njc"
fpt: "microeco/loris/datasets/main/fpt"
tax: "microeco/loris/datasets/main"
culi:
keg: "microeco/loris/datasets/culi/keg"
njc: "microeco/loris/datasets/culi/njc"
fpt: "microeco/loris/datasets/culi/fpt"
tax: "microeco/loris/datasets/culi"
warb:
keg: "microeco/loris/datasets/warble/keg"
njc: "microeco/loris/datasets/warble/njc"
fpt: "microeco/loris/datasets/warble/fpt"
tax: "microeco/loris/datasets/warble"
abund:
main:
keg: "microeco/loris/abundance/main/keg"
fpt: "microeco/loris/abundance/main/fpt"
njc: "microeco/loris/abundance/main/njc"
tax: "microeco/loris/abundance/main"
culi:
keg: "microeco/loris/abundance/culi/keg"
fpt: "microeco/loris/abundance/culi/fpt"
njc: "microeco/loris/abundance/culi/njc"
tax: "microeco/loris/abundance/culi"
warb:
keg: "microeco/loris/abundance/warble/keg"
fpt: "microeco/loris/abundance/warble/fpt"
njc: "microeco/loris/abundance/warble/njc"
tax: "microeco/loris/abundance/warble"
alpha:
main: "microeco/loris/alphadiversity/main"
culi: "microeco/loris/alphadiversity/culi"
warb: "microeco/loris/alphadiversity/warble"
beta:
main:
kegg: "microeco/loris/betadiversity/main/keg"
fpt: "microeco/loris/betadiversity/main/fpt"
njc: "microeco/loris/betadiversity/main/njc"
tax: "microeco/loris/betadiversity/main"
culi:
kegg: "microeco/loris/betadiversity/culi/keg"
fpt: "microeco/loris/betadiversity/culi/fpt"
njc: "microeco/loris/betadiversity/culi/njc"
tax: "microeco/loris/betadiversity/culi"
warb:
kegg: "microeco/loris/betadiversity/warble/keg"
fpt: "microeco/loris/betadiversity/warble/fpt"
njc: "microeco/loris/betadiversity/warble/njc"
tax: "microeco/loris/betadiversity/warble"
data:
main:
feature: "microeco/loris/datasets/main/feature_table.tsv"
tree: "microeco/loris/datasets/main/phylo_tree.tre"
fasta: "microeco/loris/datasets/main/rep_fasta.fasta"
samples: "microeco/loris/datasets/main/sample_table.tsv"
taxa: "microeco/loris/datasets/main/tax_table.tsv"
culi:
feature: "microeco/loris/datasets/culi/feature_table.tsv"
tree: "microeco/loris/datasets/culi/phylo_tree.tre"
fasta: "microeco/loris/datasets/culi/rep_fasta.fasta"
samples: "microeco/loris/datasets/culi/sample_table.tsv"
taxa: "microeco/loris/datasets/culi/tax_table.tsv"
warb:
feature: "microeco/loris/datasets/warb/feature_table.tsv"
tree: "microeco/loris/datasets/warb/phylo_tree.tre"
fasta: "microeco/loris/datasets/warb/rep_fasta.fasta"
samples: "microeco/loris/datasets/warb/sample_table.tsv"
taxa: "microeco/loris/datasets/warb/tax_table.tsv"
sample_sheets:
compilations:
bats: "../read_processing/samples/sample_sheets/bats/nwr_combined_sample_sheet.csv"
loris: "../read_processing/samples/sample_sheets/loris/hdz_combined_sample_sheet.csv"
nwr1: "../read_processing/samples/sample_sheets/bats/nwr1_sample_sheet.csv"
hdz1: "../read_processing/samples/sample_sheets/loris/hdz1_sample_sheet.csv"
hdz2: "../read_processing/samples/sample_sheets/loris/hdz2_sample_sheet.csv"
hdz3: "../read_processing/samples/sample_sheets/loris/hdz3_sample_sheet.csv"
hdz4: "../read_processing/samples/sample_sheets/loris/hdz4_sample_sheet.csv"
hdz5: "../read_processing/samples/sample_sheets/loris/hdz5_sample_sheet.csv"
hdz6: "../read_processing/samples/sample_sheets/loris/hdz6_sample_sheet.csv"
hdz7: "../read_processing/samples/sample_sheets/loris/hdz7_sample_sheet.csv"
hdz8: "../read_processing/samples/sample_sheets/loris/hdz8_sample_sheet.csv"
hdz9: "../read_processing/samples/sample_sheets/loris/hdz9_sample_sheet.csv"
hdz10: "../read_processing/samples/sample_sheets/loris/hdz10_sample_sheet.csv"
hdz11: "../read_processing/samples/sample_sheets/loris/hdz11_sample_sheet.csv"
hdz12: "../read_processing/samples/sample_sheets/loris/hdz12_sample_sheet.csv"
hdz13: "../read_processing/samples/sample_sheets/loris/hdz13_sample_sheet.csv"
hdz14: "../read_processing/samples/sample_sheets/loris/hdz14_sample_sheet.csv"
hdz15: "../read_processing/samples/sample_sheets/loris/hdz15_sample_sheet.csv"
hdz16: "../read_processing/samples/sample_sheets/loris/hdz16_sample_sheet.csv"
hdz17: "../read_processing/samples/sample_sheets/loris/hdz17_sample_sheet.csv"
hdz18: "../read_processing/samples/sample_sheets/loris/hdz18_sample_sheet.csv"
barcode_alignments:
compilations:
loris: "../read_processing/samples/barcode_alignments/loris/hdz_combined_barcode_alignment.tsv"
bats: "../read_processing/samples/barcode_alignments/bats/nwr_combined_barcode_alignment.tsv"
nwr1: "../read_processing/samples/barcode_alignments/bats/nwr1_barcode_alignment.tsv"
hdz1: "../read_processing/samples/barcode_alignments/loris/hdz1_barcode_alignment.tsv"
hdz2: "../read_processing/samples/barcode_alignments/loris/hdz2_barcode_alignment.tsv"
hdz3: "../read_processing/samples/barcode_alignments/loris/hdz3_barcode_alignment.tsv"
hdz4: "../read_processing/samples/barcode_alignments/loris/hdz4_barcode_alignment.tsv"
hdz5: "../read_processing/samples/barcode_alignments/loris/hdz5_barcode_alignment.tsv"
hdz6: "../read_processing/samples/barcode_alignments/loris/hdz6_barcode_alignment.tsv"
hdz7: "../read_processing/samples/barcode_alignments/loris/hdz7_barcode_alignment.tsv"
hdz8: "../read_processing/samples/barcode_alignments/loris/hdz8_barcode_alignment.tsv"
hdz9: "../read_processing/samples/barcode_alignments/loris/hdz9_barcode_alignment.tsv"
hdz10: "../read_processing/samples/barcode_alignments/loris/hdz10_barcode_alignment.tsv"
hdz11: "../read_processing/samples/barcode_alignments/loris/hdz11_barcode_alignment.tsv"
hdz12: "../read_processing/samples/barcode_alignments/loris/hdz12_barcode_alignment.tsv"
hdz13: "../read_processing/samples/barcode_alignments/loris/hdz13_barcode_alignment.tsv"
hdz14: "../read_processing/samples/barcode_alignments/loris/hdz14_barcode_alignment.tsv"
hdz15: "../read_processing/samples/barcode_alignments/loris/hdz15_barcode_alignment.tsv"
hdz16: "../read_processing/samples/barcode_alignments/loris/hdz16_barcode_alignment.tsv"
hdz17: "../read_processing/samples/barcode_alignments/loris/hdz17_barcode_alignment.tsv"
hdz18: "../read_processing/samples/barcode_alignments/loris/hdz18_barcode_alignment.tsv"
abund_wf16s_files:
hdz1: "data/loris/wf16s_abundance/hdz1_abundance_table_species.tsv"
hdz2: "data/loris/wf16s_abundance/hdz2_abundance_table_species.tsv"
hdz3: "data/loris/wf16s_abundance/hdz3_abundance_table_species.tsv"
hdz4: "data/loris/wf16s_abundance/hdz4_abundance_table_species.tsv"
hdz5: "data/loris/wf16s_abundance/hdz5_abundance_table_species.tsv"
hdz6: "data/loris/wf16s_abundance/hdz6_abundance_table_species.tsv"
hdz7: "data/loris/wf16s_abundance/hdz7_abundance_table_species.tsv"
hdz8: "data/loris/wf16s_abundance/hdz8_abundance_table_species.tsv"
hdz9: "data/loris/wf16s_abundance/hdz9_abundance_table_species.tsv"
hdz10: "data/loris/wf16s_abundance/hdz10_abundance_table_species.tsv"
hdz11: "data/loris/wf16s_abundance/hdz11_abundance_table_species.tsv"
hdz12: "data/loris/wf16s_abundance/hdz12_abundance_table_species.tsv"
hdz13: "data/loris/wf16s_abundance/hdz13_abundance_table_species.tsv"
hdz14: "data/loris/wf16s_abundance/hdz14_abundance_table_species.tsv"
hdz15: "data/loris/wf16s_abundance/hdz15_abundance_table_species.tsv"
hdz16: "data/loris/wf16s_abundance/hdz16_abundance_table_species.tsv"
hdz17: "data/loris/wf16s_abundance/hdz17_abundance_table_species.tsv"
hdz18: "data/loris/wf16s_abundance/hdz18_abundance_table_species.tsv"
methods_16s:
libprep_workflow: "'rapid16s'"
dorado_model: "'dna_r10.4.1_e8.2_400bps_sup@v5.0.0'"
min_length: 1000
max_length: 2000
min_qual: 7
min_id: 85
min_cov: 80
kit_name: "'SQK-16S114-24'"
tax_rank: "S"
n_taxa_barplot: 12
abund_threshold: 0
loris:
rarefy: 3500
norm: "SRS"
min_abund: 0.00001
min_freq: 0.01
include_lowest: TRUE
unifrac: TRUE
betadiv: "aitchison"
alpha_pd: TRUE
tax4fun_db: "Ref99NR"
loris_rarefy: 3500
keg_minID: 97
global <- config::get(config = "default")
swan <- config::get(config = "swan")
micro <- config::get(config = "microbiome")
loris <- config::get(config = "loris")
marmoset <- config::get(config = "marmoset")
isolates <- config::get(config = "isolates")
bats <- config::get(config = "bats")
methods_16s <- config::get(config = "methods_16s")
sample_sheets <- config::get(config = "sample_sheets")
abund_wf16s_files <- config::get(config = "abund_wf16s_files")
barcode_alignments<- config::get(config = "barcode_alignments")
seqruns <- seqruns %>% keep_at(params$sampleset) %>% list_flatten(name_spec = "")
subject_list <- keep_at(subjects, paste0(params$sampleset)) %>% list_flatten(name_spec = "{inner}")
path <- config::get(config = paste0(params$sampleset))
I use custom language engines in some scripts that I named “terminal” and “bash”. If you see a chunk of code with {terminal, warning = FALSE} written where you would usually see {r} at the top of the chunk, then running the chunk should only print that code as a text string in this document. This just makes it easier for me to copy and paste the code directly into the terminal panel that I use in my R Studio window when running code through a remote server instead of my local R console. There are ways to set R Studio up to run code through multiple servers, but I find this the simplest way to switch back and forth while still keeping a record of the code that have used or changes I have made to it.
You can expand the content below to see the code that I run at the start of many of these scripts to initiate those language engines.
knitr::knit_engines$set(terminal = function(options) {
code <- paste(options$code, collapse = "\n")
params <- map(params, ~ if (is.atomic(.)) {list(.)} else {(.)}) %>%
list_flatten()
patterns <- list(
params = list(
sampleset = paste0(params$sampleset),
seqrun = paste0(params$seqrun),
samplesheet = as.character(sample_sheets[paste0(tolower(params$seqrun))])
) ,
global = global ,
path = path ,
swan = swan ,
micro = micro ,
loris = loris ,
isolates = isolates ,
bats = bats ,
methods_16s = methods_16s ,
sample_sheets = sample_sheets ,
abund_wf16s_files = abund_wf16s_files ,
barcode_alignments= barcode_alignments
)
# Replace placeholders group by group
for (group in names(patterns)) {
placeholder_list <- patterns[[group]]
for (name in names(placeholder_list)) {
placeholder <- paste(group, name, sep = "\\$") # Match exact placeholder
value <- placeholder_list[[name]]
# Replace placeholders exactly and avoid breaking suffixes
code <- gsub(placeholder, value, code, perl = TRUE)
}
}
options$warning <- FALSE
knitr::engine_output(options, code, out = code)
})
knitr::opts_chunk$set(message = FALSE,
warning = FALSE,
echo = TRUE,
include = TRUE,
eval = TRUE,
comment = "")